Background and Context

Businesses like banks that provide service have to worry about the problem of 'Churn' i.e. customers leaving and joining another service provider. It is important to understand which aspects of the service influence a customer's decision in this regard. Management can concentrate efforts on the improvement of service, keeping in mind these priorities.

Objective

Given a Bank customer, build a neural network-based classifier that can determine whether they will leave or not in the next 6 months.

Data Description

The case study is from an open-source dataset from Kaggle. The dataset contains 10,000 sample points with 14 distinct features such as CustomerId, CreditScore, Geography, Gender, Age, Tenure, Balance, etc

Data Dictionary:

Import Libraries

Import data

Read in data and review samples

Read in data

Review Sample data

Insights:

Check Shape of Data, Check data types and number of non-null values for each column.

Insights:

Insights:

Change the object type variables to Categorical variables for space

Review counts of the category variables

Insights:

Insights:

Check unique values

Insights:

Summary of Data

Insights:

Insights:

Univariate analysis

Create Histograms and Boxplots

Insights:

Insights:

Insights:

Insights:

Insights:

Insights:

CDF plot of numerical variables

Insights:

Review Categorical Variables

Function to create Bar Charts of percentages

Insights:

Insights:

Insights:

Insights:

Insights:

Insights:

Bivariate analysis

Insights:

Review Correlation Heat Map

Insights:

Reviews crosstab of categorical varibles with proportion of Exited

Insights:

Insights:

Insights:

Insights:

Insights:

Insights:

Review the continous variables using distribrution plots

Insights:

Insights:

Insights:

Insights:

Insights:

Insights:

Insights:

Insights:

Insights:

Create customer profile by Exited status

Additional Deep Dives

Since Age, Balance and Geography appear to be predictive variables. Deeper dive into understanding them is required

Age

Insights:

Balance

Insights:

Geography

Insights:

Data Pre-processing

Prepare the data for analysis - Missing value Treatment, Outlier Detection(treat, if needed- why or why not ), Feature Engineering, Prepare data for modeling

Insights:

Create Maps to the bins for different variables

Remap Variables

One hot encoding for Select category variables

Check to see all the data types are numeric before model building

Insights:

Drop unneeded variables

Insights:

Split the target variable and predictors - Split the data into train and test - Rescale the data

Insights:

Splitting the Data into train and test set

Create Model

Model 1: basic DNN Model

For Model will be a basic model with 1 hidden layer with 8 nodes. Also, since this is a binary class the output will be of class 1

Training [Forward pass and Backpropagation]

Plotting the train and test loss

Evaluation

Lets evalute the 1st model against the test data

The model did better than if we would have guessed zero on all of the for all the predictions.Also,there doesn't appear to be any severe overfitting

Functions to create Confusion Matrix and Scoring Metrics

Insights:

Review ROC Curve to see the optimal threshold

Insights:

Insights:

A mixture of recall and precision is the goal. The reason is that we don't want to spend too much money on marketing for retention by targeting everyone, but we also don't want to miss the customers who are high risk of exiting.

Model2

Training [Forward pass and Backpropagation]

Plotting the train and test loss

Evaluation

Lets evalute the second model against the test data

Review ROC Curve to see the optimal threshold

Conclusion

Confusion Matrix

True Positive (observed=1,predicted=1):

Predicted Exiter and the customer is actually a Exiter.

False Positive (observed=0,predicted=1):

Predicted Exiter and the customer is not a Exiter.

True Negative (observed=0,predicted=0):

Predicted not an Exiter and the customer is not an Exiter.

False Negative (observed=1,predicted=0):

Predicted not an Exiter and the customer is a Exiter.

Important Metric

The important Metric is recall. The bank will want keep as many customers as possible which means we want the least amount of False Negatives. Other words target the as many people as possible who are predicted to leave. With this being said, we also need to balance the False Positives as we don't want to spend too much money on marketing. The focus should be on recall sense it minimizes False Negatives

Important Features

Country, Age and balance are important features with this data

Final Summary

The analysis used 11 independent variables. The first DNN model used a hidden layer with 8 nodes making use of relu for hidden layer and Sigmond for output layer. The initial accuracy of model was 86% with recall of 48%. Since the recall was low or false negatives were high, ROC threshold was optimized with a geometric mean of sensitivy and precision. Once a new threshold was found, the recall on model improved from 48% to 76%.

Even though this was a big improvement, a model 2 was built using class weights to handle the imbalance data. Also, increased hidden layers to two. The second model accuracy was 79% and recall was 74% without using theshold optimization. This was a substantial improvement in recall compared to the first model without thresold optimization. The second model with optimized threshold had accuracy of 80% and recall of 74%. Overall, there wasn't dramatic improvement between the model. With only 10k observations of data it is hard to develop the best model. I would recommend to try other methods such as logistic regression or a ensemble method. Also, I recommend gather more data

Business Recommendations